AITopics | Freestone County

Collaborating Authors

Freestone County

RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code

Gautam, Dhruv, Garg, Spandan, Jang, Jinu, Sundaresan, Neel, Moghaddam, Roshanak Zilouchian

arXiv.org Artificial IntelligenceMar-10-2025

Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce RefactorBench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. Solving tasks within RefactorBench requires thorough exploration of dependencies across multiple files and strong adherence to relevant instructions. Every task is defined by 3 natural language instructions of varying specificity and is mutually exclusive, allowing for the creation of longer combined tasks on the same repository. Baselines on RefactorBench reveal that current LM agents struggle with simple compositional tasks, solving only 22% of tasks with base instructions, in contrast to a human developer with short time constraints solving 87%. Through trajectory analysis, we identify various unique failure modes of LM agents, and further explore the failure mode of tracking past actions. By adapting a baseline agent to condition on representations of state, we achieve a 43.9% improvement in solving RefactorBench tasks. We further extend our state-aware approach to encompass entire digital environments and outline potential directions for future research. RefactorBench aims to support the study of LM agents by providing a set of real-world, multi-hop tasks within the realm of code. "Repetition is the root of all software evil" -- Martin Fowler Large language models (LLMs) have been quickly acquiring new capabilities (Bubeck et al., 2023), leading towards adoption of AI-powered systems in various formats and domains. The increasing usage of LLM powered tools like Github Copilot have greatly improved the capability of developers in software development tasks (Peng et al., 2023). More recently, an emphasis on multi-step execution through LLM feedback loops has unlocked the ability to solve harder problems within a variety of fields (Reed et al., 2022; Sumers et al., 2024; Yao & Narasimhan, 2023), including parts of software engineering. This new paradigm of solving larger software tasks has led to the construction of a variety of new automated software engineering (ASE) systems, most being structured as LM agents (Wang et al., 2024c; Cognition.ai, Evaluations for such systems are currently largely comprised from real world data on Github (Jimenez et al., 2024; LaBash et al., 2024). While being the strongest open-source signal for software engineering tasks at scale, Github is inherently noisy through its snapshot nature, also requiring strong filtration and validation testing for reliable evaluations (Chowdhury et al., 2024; Bowman & Dahl, 2021).

agent, arxiv, instruction, (16 more...)

arXiv.org Artificial Intelligence

2503.07832

Country: North America > United States > Texas > Freestone County (0.24)

Genre: Research Report (0.68)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Statistical Rejection Sampling Improves Preference Optimization

Liu, Tianqi, Zhao, Yao, Joshi, Rishabh, Khalman, Misha, Saleh, Mohammad, Liu, Peter J., Liu, Jialu

arXiv.org Artificial IntelligenceJan-23-2024

Improving the alignment of language models with human preferences remains an active research challenge. Previous approaches have primarily utilized Reinforcement Learning from Human Feedback (RLHF) via online RL methods such as Proximal Policy Optimization (PPO). Recently, offline methods such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO) have emerged as attractive alternatives, offering improvements in stability and scalability while maintaining competitive performance. SLiC refines its loss function using sequence pairs sampled from a supervised fine-tuned (SFT) policy, while DPO directly optimizes language models based on preference data, foregoing the need for a separate reward model. However, the maximum likelihood estimator (MLE) of the target optimal policy requires labeled preference pairs sampled from that policy. DPO's lack of a reward model constrains its ability to sample preference pairs from the optimal policy, and SLiC is restricted to sampling preference pairs only from the SFT policy. To address these limitations, we introduce a novel approach called Statistical Rejection Sampling Optimization (RSO) that aims to source preference data from the target optimal policy using rejection sampling, enabling a more accurate estimation of the optimal policy. We also propose a unified framework that enhances the loss functions used in both SLiC and DPO from a preference modeling standpoint. Through extensive experiments across three diverse tasks, we demonstrate that RSO consistently outperforms both SLiC and DPO on evaluations from both Large Language Model (LLM) and human raters.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2309.06657

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
South America > Chile (0.04)
North America > United States > Texas > Freestone County (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

Score-Based Generative Modeling with Critically-Damped Langevin Diffusion

Dockhorn, Tim, Vahdat, Arash, Kreis, Karsten

arXiv.org Machine LearningDec-30-2021

Score-based generative models (SGMs) have demonstrated remarkable synthesis quality. SGMs rely on a diffusion process that gradually perturbs the data towards a tractable distribution, while the generative model learns to denoise. The complexity of this denoising task is, apart from the data distribution itself, uniquely determined by the diffusion process. We argue that current SGMs employ overly simplistic diffusions, leading to unnecessarily complex denoising processes, which limit generative modeling performance. Based on connections to statistical mechanics, we propose a novel critically-damped Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior performance. CLD can be interpreted as running a joint diffusion in an extended space, where the auxiliary variables can be considered "velocities" that are coupled to the data variables as in Hamiltonian dynamics. We derive a novel score matching objective for CLD and show that the model only needs to learn the score function of the conditional distribution of the velocity given data, an easier task than learning scores of the data directly. We also derive a new sampling scheme for efficient synthesis from CLD-based diffusion models. We find that CLD outperforms previous SGMs in synthesis quality for similar network architectures and sampling compute budgets. We show that our novel sampler for CLD significantly outperforms solvers such as Euler--Maruyama. Our framework provides new insights into score-based denoising diffusion models and can be readily used for high-resolution image synthesis. Project page and code: https://nv-tlabs.github.io/CLD-SGM.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Machine Learning

2112.07068

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > Texas > Freestone County (0.04)
(7 more...)

Genre: Research Report > New Finding (0.67)

Industry: Media (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback